Working with R and RStudio

Frontmatter (Spring 2026)

I like to use this spot to publish course announcements. Not so much for y’all, but more so I remember. If you see any announcements that don’t say “Spring 2026” there’s a good chance it’s leftover from earlier course offerings. That’s a reasonable indicator that you have jumped ahead… or… I forgot to edit something.

Participation extra credit

A careful read of our syllabus under “class participation” will show that I do give extra credit for answering questions and (mainly) sharing completed R coding tasks. That is, we’ll walk through some examples, and when we hit a box that looks like this:

Try it

Do some stuff in R

I’ll ask you to give it a go. Then, after a few minutes, I’ll ask if anyone wants to share their answer. You get one point of your five necessary participation credits (or extra credit at 1/5th value after the first 5). We will use Zoom only to share screen – the up-arrow logo in the website header next to the Slack logo will take you to the sharing zoom. You can join whenever, I’ll call on folks to share using the participant list.

Assignment / Exercises

The Assignments page has all of our weekly lab assignments (including Week 1, due on Monday at 11:59pm). The assignments often have a preamble and some code that has to be used to set you up for the questions. The questions to be completed and turned in are under “Exercises” at the very end.

What I assume you know

If you have not yet installed R and RStudio, and if you have not yet successfully rendered the “weekly writing” template, then please make sure you do so today. Use the resources page for instructions on installation of R, RStudio, and tinytex for rendering to latex pdfs, as is required in this class.

Pre-requisites for EC242 include PLS202, which covers the basics of using R and RStudio. Thus, we do not spend any class time on the workings of R. It is assumed you already know how to run a script, the basics of “object oriented programming” (how R refers to and manpiulates data like vectors or data.frames, etc.), and how to put together a code file that runs an analysis.

We have created a video walkthrough for the basics of using R for another course, but it is useful here. You can see part A here (labeled “Part 2a”) here ] and part B here (labeled “Part 2b”) . You should already be at this level of familiarity with R, but if you need a review, this is a good place to start.

If you are still unsure of the basic workings of R, I have created a review resource for you here: Using R

How we use Rmarkdown

Rmarkdown lets us combine the processing and output from R code with text and headers written in plain english, which lets us do something in code and then show it and discuss it in one place.

An .Rmd (Rmarkdown document) like your lab and weekly writing templates has three parts: first, a YAML header up at the top that establishes some variables for use in rendering to PDF. Second, “code chunks” that are processed by R. And third, markdown text that is processed as normal text (via markdown langugage). You do work in code chunks, the output is included in the document, and you discuss the results in-line. When you render your document using the “knit” button, R will construct the final PDF output by running code chunks in order, and merging their output with your text. Make sure you read using Rmarkdown and using markdown before you do your first weekly reading.

The header on the code chunk tells Rstudio what language to use to run the chunk (r), and can take some settings for displaying output. The one you want to know now is echo=T. When this is TRUE (the document’s default), then knitting will include a copy of your code along with the output. Don’t change this to FALSE or I can’t see your work when grading.

A code chunk in a .Rmd

Your R code goes here in these “chunks”. In the upper right, you’ll see a green down-pointing triangle and a green right-pointing triangle. The first one (down-pointing) runs all of the previous code chunks up to this one while the second (right-pointing) runs this code chunk.

This is very useful when you are iterating through steps to develop your code. Running a code chunk will show you the output from that code chunk, which is what will drop into your .Rmd file when you knit it. Note that you can also highlight code and use CTRL+ENTER (or CMD+ENTER for macs) to run code.

Code Flow: Your script needs to contain all the steps you have taken to complete the assignment (or create your group project, etc.). For each assignment, you will have a single script that can run “from the top” and generate your results. You will be primarily working with an Rmarkdown file, so all your R progress, from loading packages and data to the final plot or output, will be in code chunks with your written answers in between.

It is important to get a grasp on this paradigm – all code chunks are processed in order (a “flow”), and early code chunks influence later code chunks. All work has to be in the code, and it has to be in sequential order – if you refer to an object in the 3rd code chunk from the 2nd chunk, you’ll get an error, even though your local environment may have that object in it.

When you “knit” your script, it runs from a fresh, clean, empty state, and your environment is not accessible to it. Any work you do directly into the console will not apply when you run your code from the top. You can work in the console, but it’s absolutely vital that you then copy your work to a code chunk. This will be a source of frustration while you get used to the paradigm of code flow.

If you’re unclear on the idea of R’s “workspace”, see here

install.packages()

R uses packages to add functionality. Much of R is really based on additional functionality (with “Base R” being a fairly stripped-down set of functions). As such, we’ll need to install some packages. We’ll state which packages are needed at the top of every unit and assignment. You need only install a package once on your computer, and (counter-intuitive to our discussion on code flow), you should never, ever have install.packages() in your recorded code (in your “code” file if using an R script, or in your code chunks in a .Rmd). If it’s in your .Rmd file, when you “knit”, it’ll try to install the packages and will get very confused and throw an error. To install a package, you type (directly in the console) install.packages('packageName').

To use a package, you include (in your .Rmd, usually in the first code chunk) library(packageName).

Data Types and Coding Concepts

For a full review of data types, see the usingR resource. I’m going to cover some of the important, non-basic data types we’ll be using.

Functions

Once you define variables, the data analysis process can usually be described as a series of functions applied to the data. R includes several zillion predefined functions and most of the analysis pipelines we construct make extensive use of the built-in functions. But R’s power comes from its scalability. We have access to (nearly) infinite functions via install.packages and library. As we go through the course, we will carefully note new functions we bring to each problem. For now, though, we will stick to the basics.

In general, we need to use parentheses to evaluate a function. If you type ls, the function is not evaluated and instead R shows you the code that defines the function. If you type ls() the function is evaluated and, as seen above, we see objects in the workspace.

Unlike ls, most functions require one or more arguments. Below is an example of how we assign an object to the argument of the function log.:

log(8)
[1] 2.079442
a = 1
log(a)
[1] 0

You can find out what the function expects and what it does by reviewing the very useful manuals included in R. You can get help by using the help function like this:

help("log")

For most functions, we can also use this shorthand:

?log

The help page will show you what arguments the function is expecting. For example, log needs x and base to run. However, some arguments are required and others are optional. You can determine which arguments are optional by noting in the help document that a default value is assigned with =. Defining these is optional.1 For example, the base of the function log defaults to base = exp(1)—that is, log evaluates the natural log by default.

If you want a quick look at the arguments without opening the help system, you can type:

args(log)
function (x, base = exp(1)) 
NULL

You can change the default values by simply assigning another object:

log(8, base = 2)
[1] 3

Note that we have not been specifying the argument x as such:

log(x = 8, base = 2)
[1] 3

The above code works, but we can save ourselves some typing: if no argument name is used, R assumes you are entering arguments in the order shown in the help file or by args. So by not using the names, it assumes the arguments are x followed by base:

log(8,2)
[1] 3

If using the arguments’ names, then we can include them in whatever order we want:

log(base = 2, x = 8)
[1] 3

To specify arguments, we must use =, and cannot use <-.

There are some exceptions to the rule that functions need the parentheses to be evaluated. Among these, the most commonly used are the arithmetic and relational operators. For example:

2 ^ 3
[1] 8

You can see the arithmetic operators by typing:

help("+")

or

?"+"

and the relational operators by typing:

help(">")

or

?">"
Tip

Never use ? in your code. The help operator, ?..., should only be used directly in the console. If you put it in your code, it’ll keep opening the help, and when you include it in an Rmarkdown document, it’ll behave strangely. Don’t do it!

Other prebuilt objects

There are several datasets that are included for users to practice and test out functions. You can see all the available datasets by typing:

data()

This shows you the object name for these datasets. These datasets are objects that can be used by simply typing the name. For example, if you type:

co2

R will show you Mauna Loa atmospheric \(CO^2\) concentration data.

Other prebuilt objects are mathematical quantities, such as the constant \(\pi\) and \(\infty\):

pi
[1] 3.141593
Inf+1
[1] Inf
Try it!
  1. Let’s think about how the order of code (within or between code chunks) is important: What is the sum of the first 100 positive integers? The formula for the sum of integers \(1\) through \(n\) is \(n(n+1)/2\). Define \(n=100\) and then use R to compute the sum of \(1\) through \(100\) using the formula. What is the sum?

  2. Now use the same formula to compute the sum of the integers from 1 through 1,000.

  3. Look at the result of typing the following code into R:

n <- 1000
x <- seq(1, n)
sum(x)

Based on the result, what do you think the functions seq and sum do? You can use help.

  1. sum creates a list of numbers and seq adds them up.
  2. seq creates a list of numbers and sum adds them up.
  3. seq creates a random list and sum computes the sum of 1 through 1,000.
  4. sum always returns the same number.
  1. In math and programming, we say that we evaluate a function when we replace the argument with a given number. So if we type sqrt(4), we evaluate the sqrt function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.

  2. Which of the following will always return the numeric value stored in x? You can try out examples and use the help system if you want.

  1. log(10^x)
  2. log10(x^10)
  3. log(exp(x))
  4. exp(log(x, base = 2))

Commenting your code

If a line of R code starts with the symbol #, it is not evaluated. We can use this to write reminders of why we wrote particular code. For example, in the script above we could add:

## Code to compute solution to quadratic equation of the form ax^2 + bx + c
## define the variables
a <- 3
b <- 2
c <- -1

## now compute the solution
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)

Data frames

A large proportion of data analysis challenges start with data stored in a data frame. For example, we stored the data for our motivating example in a data frame. You can access this dataset by loading the dslabs library and loading the murders dataset using the data function:

library(dslabs)
data(murders)

To see that this is in fact a data frame, we type:

class(murders)
[1] "data.frame"

Data From Packages

Woah, there – data(murders) gave me an error! Well, the data, like many functions, are part of a package. Here, it’s the dslabs package. We need to load up the package first (or install it first, as the case may be).

We can usually also refer to data in a package by referring to the package’s namespace, which is name of the package from which the data originates. So dslabs::murders is an object (from the dslabs namespace) that can be accessed that way, too. We can use a simple line of code to create the object in our environment: myMurders = dslabs::murders. Now, we have a copy of the object in our environment that we can work with.

Lists

Data frames are a special case of lists. We will cover lists in more detail later, but know that they are useful because you can store any combination of different types. In a data.frame, all columns have to be vectors of the same length (equal to the number of rows in the data.frame). In a list, each item can be of any length and of any type. Below is an example of a list we created for you:

record
$name
[1] "John Doe"

$student_id
[1] 1234

$grades
[1] 95 82 91 97 93

$final_grade
[1] "A"
class(record)
[1] "list"

As with data frames, you can extract the components of a list with the accessor $. In fact, data frames are a type of list.

record$student_id
[1] 1234

We can also use double square brackets ([[) like this:

record[["student_id"]]
[1] 1234

You should get used to the fact that in R there are often several ways to do the same thing. such as accessing entries.2

You might also encounter lists without variable names.

record2
[[1]]
[1] "John Doe"

[[2]]
[1] 1234

If a list does not have names, you cannot extract the elements with $, but you can still use the brackets method and instead of providing the variable name, you provide the list index, like this:

record2[[1]]
[1] "John Doe"

We won’t be using lists until later, but you might encounter one in your own exploration of R. For this reason, we show you some basics here.

Try it!

You should be familiar with the functions and concepts used in this TRY IT. If any are new to you, please see the using R review. Consider this to be a test for retention of PLS202 concepts and methods.

  1. Install the dslabs package, load the package, and load the US murders dataset.
library(dslabs)
data(murders)

Use the function str to examine the structure of the murders object. Which of the following best describes the variables represented in this data frame?

    1. The 51 states
    1. The murder rates for all 50 states and DC.
    1. The state name, the abbreviation of the state name, the state’s region, and the state’s population and total number of murders for 2010.
    1. str shows no relevant information.
  1. What are the column names used by the data frame for these five variables?

  2. Use the accessor $ to extract the state abbreviations and assign them to the object a. What is the class of this object?

  3. Now use the square brackets to extract the state abbreviations and assign them to the object b. Use the identical function to determine if a and b are the same.

  4. We saw that the region column stores a factor. You can corroborate this by typing:

class(murders$region)

With one line of code, use the function levels and length to determine the number of regions defined by this dataset.

  1. The function table takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.

Coercion

In general, coercion is an attempt by R to be flexible with data types. When an entry does not match the expected, some of the prebuilt R functions try to guess what was meant before throwing an error. This can also lead to confusion. Failing to understand coercion can drive programmers crazy when attempting to code in R since it behaves quite differently from most other languages in this regard. Let’s learn about it with some examples.

We said that vectors must be all of the same type. So if we try to combine, say, numbers and characters, you might expect an error:

x <- c(1, "canada", 3)

But we don’t get one, not even a warning! What happened? Look at x and its class:

x
[1] "1"      "canada" "3"     
class(x)
[1] "character"

R coerced the data into characters. It guessed that because you put a character string in the vector, you meant the 1 and 3 to actually be character strings "1" and “3”. The fact that not even a warning is issued is an example of how coercion can cause many unnoticed errors in R.

R also offers functions to change from one type to another. For example, you can turn numbers into characters with:

x <- 1:5
y <- as.character(x)
y
[1] "1" "2" "3" "4" "5"

You can turn it back with as.numeric:

as.numeric(y)
[1] 1 2 3 4 5

This function is actually quite useful since datasets that include numbers as character strings are common.

Not availables (NA)

This “topic” seems to be wholly unappreciated and it has been our experience that students often panic when encountering an NA. This often happens when a function tries to coerce one type to another and encounters an impossible case. In such circumstances, R usually gives us a warning and turns the entry into a special value called an NA (for “not available”). For example:

x <- c("1", "b", "3")
as.numeric(x)
[1]  1 NA  3

R does not have any guesses for what number you want when you type b, so it does not try.

While coercion is a common case leading to NAs, you’ll see them in nearly every real-world dataset. Most often, you will encounter the NAs as a stand-in for missing data. Again, this a common problem in real-world datasets and you need to be aware that it will come up.

We’ll spend some time in Data Wrangling working on handling NAs. They will be the bane of your existence at one point.

Sorting

Now that we have mastered some basic R knowledge (ha!), let’s try to gain some insights into the safety of different states in the context of gun murders.

sort

Say we want to rank the states from least to most gun murders. The function sort sorts a vector in increasing order. We can therefore see the largest number of gun murders by typing:

library(dslabs)
data(murders)
sort(murders$total)
 [1]    2    4    5    5    7    8   11   12   12   16   19   21   22   27   32
[16]   36   38   53   63   65   67   84   93   93   97   97   99  111  116  118
[31]  120  135  142  207  219  232  246  250  286  293  310  321  351  364  376
[46]  413  457  517  669  805 1257

However, this does not give us information about which states have which murder totals. For example, we don’t know which state had 1257.

order

The function order is closer to what we want. It takes a vector as input and returns the vector of indexes that sorts the input vector. This may sound confusing so let’s look at a simple example. We can create a vector and sort it:

x <- c(31, 4, 15, 92, 65)
sort(x)
[1]  4 15 31 65 92

Rather than sort the input vector, the function order returns the index that sorts input vector:

index <- order(x)
print(index)
[1] 2 3 1 5 4

The second entry of x is the smallest, so order(x) starts with 2. The next smallest is the third entry, so the second entry is 3 and so on.

To use this, we just need to place it as the index to x. If we look at this index, we see why it works:

x[index]
[1]  4 15 31 65 92

This is the same output as that returned by sort(x).

How does this help us order the states by murders? First, remember that the entries of vectors you access with $ follow the same order as the rows in the table. For example, these two vectors containing state names and abbreviations, respectively, are matched by their order:

murders$state[1:6]
[1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
[6] "Colorado"  
murders$abb[1:6]
[1] "AL" "AK" "AZ" "AR" "CA" "CO"

This means we can order the state names by their total murders. We first obtain the index that orders the vectors according to murder totals and then index the state names vector:

ind <- order(murders$total)
murders$abb[ind]
 [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT" "WV" "NE"
[16] "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI" "DC" "OK" "KY" "MA"
[31] "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC" "MD" "OH" "MO" "LA" "IL" "GA"
[46] "MI" "PA" "NY" "FL" "TX" "CA"

According to the above, California had the most murders.

If we wanted to re-order the whole data.frame based on the murders$total index, and write it with the new order:

murders_ordered = murders[ind,]

This saves murders in a new data.frame called murders_ordered that is in the order defined by ind.

Usually when you’re working with your data, you’ll need to decide if you want to save as a new object (like murders_ordered), or just overwrite the original (murders = murders[ind,]). The answer depends on your workflow.

max and which.max

If we are only interested in the entry with the largest value, we can use max for the value:

max(murders$total)
[1] 1257

and which.max for the index of the largest value:

i_max <- which.max(murders$total)
murders$state[i_max]
[1] "California"

For the minimum, we can use min and which.min in the same way.

Does this mean California is the most dangerous state? In an upcoming section, we argue that we should be considering rates instead of totals. Before doing that, we introduce one last order-related function: rank.

rank

Although not as frequently used as order and sort, the function rank is also related to order and can be useful. For any given vector it returns a vector with the rank of the first entry, second entry, etc., of the input vector. Here is a simple example:

x <- c(31, 4, 15, 92, 65)
rank(x)
[1] 3 1 2 5 4

To summarize, let’s look at the results of the three functions we have introduced:

original sort order rank
31 4 2 3
4 15 3 1
15 31 1 2
92 65 5 5
65 92 4 4

Beware of recycling

Another common source of unnoticed errors in R is the use of recycling. We saw that vectors are added elementwise. So if the vectors don’t match in length, it is natural to assume that we should get an error. But we don’t. Notice what happens:

x <- c(1,2,3)
y <- c(10, 20, 30, 40, 50, 60, 70)
x+y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 11 22 33 41 52 63 71

We do get a warning, but no error. For the output, R has recycled the numbers in x. Notice the last digit of numbers in the output.

Note

TRY IT

For these exercises we will use the US murders dataset. Make sure you load it prior to starting.

library(dslabs)
data("murders")
  1. Use the $ operator to access the population size data and store it as the object pop. Then use the sort function to redefine pop so that it is sorted. Finally, use the [ operator to report the smallest population size.

  2. Now instead of the smallest population size, find the index of the entry with the smallest population size. Hint: use order instead of sort.

  3. We can actually perform the same operation as in the previous exercise using the function which.min. Write one line of code that does this.

  4. Now we know how small the smallest state is and we know which row represents it. Which state is it? Define a variable states to be the state names from the murders data frame. Report the name of the state with the smallest population.

  5. You can create a data frame using the data.frame function. Here is a quick example:

temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",
          "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)

Use the rank function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks, then create a data frame with the state name and its rank. Call the data frame my_df.

  1. Repeat the previous exercise, but this time order my_df so that the states are ordered from least populous to most populous. Hint: create an object ind that stores the indexes needed to order the population values. Then use the bracket operator [ to re-order each column in the data frame.

  2. The na_example vector represents a series of counts. You can quickly examine the object using:

data("na_example")
str(na_example)
 int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...

However, when we compute the average with the function mean, we obtain an NA:

mean(na_example)
[1] NA

The is.na function returns a logical vector that tells us which entries are NA. Assign this logical vector to an object called ind and determine how many NAs does na_example have. Note that TRUE=1 and FALSE=0 when “coerced”.

Vector arithmetics

California had the most murders, but does this mean it is the most dangerous state? What if it just has many more people than any other state? We can quickly confirm that California indeed has the largest population:

library(dslabs)
data("murders")
murders$state[which.max(murders$population)]
[1] "California"

with over 37 million inhabitants. It is therefore unfair to compare the totals if we are interested in learning how safe the state is. What we really should be computing is the murders per capita. The reports we describe in the motivating section used murders per 100,000 as the unit. To compute this quantity, the powerful vector arithmetic capabilities of R come in handy.

Rescaling a vector

In R, arithmetic operations on vectors occur element-wise. For a quick example, suppose we have height in inches:

inches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)

and want to convert to centimeters. Notice what happens when we multiply inches by 2.54:

inches * 2.54
 [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80

In the line above, we multiplied each element by 2.54. Similarly, if for each entry we want to compute how many inches taller or shorter than 69 inches, the average height for males, we can subtract it from every entry like this:

inches - 69
 [1]  0 -7 -3  1  1  4 -2  4 -2  1

Two vectors

If we have two vectors of the same length, and we sum them in R, they will be added entry by entry as follows:

\[ \begin{pmatrix} a\\ b\\ c\\ d \end{pmatrix} + \begin{pmatrix} e\\ f\\ g\\ h \end{pmatrix} = \begin{pmatrix} a +e\\ b + f\\ c + g\\ d + h \end{pmatrix} \]

The same holds for other mathematical operations, such as -, * and /.

This implies that to compute the murder rates we can simply type:

murder_rate <- murders$total / murders$population * 100000

Once we do this, we notice that California is no longer near the top of the list. In fact, we can use what we have learned to order the states by murder rate:

murders$abb[order(murder_rate)]
 [1] "VT" "NH" "HI" "ND" "IA" "ID" "UT" "ME" "WY" "OR" "SD" "MN" "MT" "CO" "WA"
[16] "WV" "RI" "WI" "NE" "MA" "IN" "KS" "NY" "KY" "AK" "OH" "CT" "NJ" "AL" "IL"
[31] "OK" "NC" "NV" "VA" "AR" "TX" "NM" "CA" "FL" "TN" "PA" "AZ" "GA" "MS" "MI"
[46] "DE" "SC" "MD" "MO" "LA" "DC"

Right now, the murder_rate object isn’t in the murders data.frame, but we know it’s the right length (why?). So we can add it:

murders$rate = murder_rate

Note that now, we have two copies of the same vector of numbers – one called murder_rate floatin’ around in our environment, and another in our murders data.frame with the column name rate. If we re-order murder_rate, it won’t affect anything in murders$rate and vice versa.

Try it!
  1. Previously we created this data frame:
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",
          "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)

Remake the data frame using the code above, but add a line that converts the temperature from Fahrenheit to Celsius. The conversion is \(C = \frac{5}{9} \times (F - 32)\).

  1. Write code to compute the following sum \(1+1/2^2 + 1/3^2 + \dots 1/100^2\)? Hint: thanks to Euler, we know it should be close to \(\pi^2/6\).

  2. Compute the per 100,000 murder rate for each state and store it in a new column called murder_rate. Then compute the average murder rate for the US using the function mean. What is the average?

Indexing

Indexing is a boring name for an important tool. R provides a powerful and convenient way of referencing specific elements of vectors. We can, for example, subset a vector based on properties of another vector. In this section, we continue working with our US murders example from before.

Subsetting with logicals

Imagine you are moving from Italy where, according to an ABC news report, the murder rate is only 0.71 per 100,000. You would prefer to move to a state with a similar murder rate. Another powerful feature of R is that we can use logicals to index vectors. If we compare a vector to a single number, it actually performs the test for each entry. The following is an example related to the question above:

ind <- murder_rate < 0.71

If we instead want to know if a value is less or equal, we can use:

ind <- murder_rate <= 0.71

Note that we get back a logical vector with TRUE for each entry smaller than or equal to 0.71. To see which states these are, we can leverage the fact that vectors can be indexed with logicals.

murders$state[ind]
[1] "Hawaii"        "Iowa"          "New Hampshire" "North Dakota" 
[5] "Vermont"      

In order to count how many are TRUE, the function sum returns the sum of the entries of a vector and logical vectors get coerced to numeric with TRUE coded as 1 and FALSE as 0. Thus we can count the states using:

sum(ind)
[1] 5

Since ind has the same length as all of the columns in murders, it can be used as a row index. When used as a row index, it will return all the rows for which the condition was true. If we use this, leaving the column index blank (for all columns):

murders[ind,]
           state abb        region population total      rate
12        Hawaii  HI          West    1360301     7 0.5145920
16          Iowa  IA North Central    3046355    21 0.6893484
30 New Hampshire  NH     Northeast    1316470     5 0.3798036
35  North Dakota  ND North Central     672591     4 0.5947151
46       Vermont  VT     Northeast     625741     2 0.3196211

Logical operators

Suppose we like the mountains and we want to move to a safe state in the western region of the country. We want the murder rate to be at most 1. In this case, we want two different things to be true. Here we can use the logical operator and, which in R is represented with &. This operation results in TRUE only when both logicals are TRUE. To see this, consider this example:

TRUE & TRUE
[1] TRUE
TRUE & FALSE
[1] FALSE
FALSE & FALSE
[1] FALSE

For our example, we can form two logicals:

west <- murders$region == "West"
safe <- murder_rate <= 1

and we can use the & to get a vector of logicals that tells us which states satisfy both conditions:

ind <- safe & west
murders$state[ind]
[1] "Hawaii"  "Idaho"   "Oregon"  "Utah"    "Wyoming"

which

Suppose we want to look up California’s murder rate. For this type of operation, it is convenient to convert vectors of logicals into indexes instead of keeping long vectors of logicals. The function which tells us which entries of a logical vector are TRUE. So we can type:

ind <- which(murders$state == "California")
murders$rate[ind]
[1] 3.374138

%in%

If rather than an index we want a logical that tells us whether or not each element of a first vector is in a second, we can use the function %in%. Let’s imagine you are not sure if Boston, Dakota, and Washington are states. You can find out like this:

c("Boston", "Dakota", "Washington") %in% murders$state
[1] FALSE FALSE  TRUE

Note that we will be using %in% often throughout the course

Try it!

Start by loading the library and data.

library(dslabs)
data(murders)

Note that every time you run this, you replace murder in your environment. So if you had created the murders$rate column, it’s gone. But if you created the free-floating object murder_rate, that still exists (but can be overwritten).

  1. Compute the per 100,000 murder rate for each state and store it in an object called murder_rate. Then use logical operators to create a logical vector named low that tells us which entries of murder_rate are lower than 1.

  2. Now use the results from the previous exercise and the function which to determine the indices of murder_rate associated with values lower than 1.

  3. Use the results from the previous exercise to report the names of the states with murder rates lower than 1.

  4. Now extend the code from exercises 2 and 3 to report the states in the Northeast with murder rates lower than 1. Hint: use the previously defined logical vector low and the logical operator &.

  5. In a previous exercise we computed the murder rate for each state and the average of these numbers. How many states are below the average?

  6. Use the match function to identify the states with abbreviations AK, MI, and IA. Hint: start by defining an index of the entries of murders$abb that match the three abbreviations, then use the [ operator to extract the states.

  7. Use the %in% operator to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?

  8. Extend the code you used in exercise 7 to report the one entry that is not an actual abbreviation. Hint: use the ! operator, which turns FALSE into TRUE and vice versa, then which to obtain an index.

Ope!

If we get all the way to the end of Week 01b, we will likely start on Week 02a – the tidyverse. I’m re-arranging the content and still fine-tuning length.

Further help with R

If you are not comfortable with R, the earlier you seek out help, the better. Quietly letting the course pass by you because you don’t know how to fix an error will do nobody any good. Attend TA office hours or attend TA or Prof. Kirkpatrick’s office hours see Syllabus for times and Zoom links. Also, join the course Slack (see the front page of our course website for a link) and post questions there.

Finally, there are also primers on Rstudio.cloud that can be useful. There are many ways we can help you get used to R, but only if you reach out.

Footnotes

  1. This equals sign is the reasons we assign values with <-; then when arguments of a function are assigned values, we don’t end up with multiple equals signs. But… who cares.↩︎

  2. Whether you view this as a feature or a bug is a good indicator whether you’ll enjoy working with R.↩︎